Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens
Identifieur interne : 000D49 ( Main/Exploration ); précédent : 000D48; suivant : 000D50Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens
Auteurs : Stoyan Mihov ; Petar Mitankin ; Annette Gotscharek [Allemagne] ; Ulrich Reffle [Allemagne] ; U. Schulz [Allemagne] ; Christoph RinglstetterSource :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2007.
Abstract
Abstract: Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.
Url:
DOI: 10.1007/978-3-540-76928-6_47
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000B68
- to stream Istex, to step Curation: 000B53
- to stream Istex, to step Checkpoint: 000764
- to stream Main, to step Merge: 000D62
- to stream Main, to step Curation: 000D49
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct:series"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens</title>
<author><name sortKey="Mihov, Stoyan" sort="Mihov, Stoyan" uniqKey="Mihov S" first="Stoyan" last="Mihov">Stoyan Mihov</name>
</author>
<author><name sortKey="Mitankin, Petar" sort="Mitankin, Petar" uniqKey="Mitankin P" first="Petar" last="Mitankin">Petar Mitankin</name>
</author>
<author><name sortKey="Gotscharek, Annette" sort="Gotscharek, Annette" uniqKey="Gotscharek A" first="Annette" last="Gotscharek">Annette Gotscharek</name>
</author>
<author><name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
</author>
<author><name sortKey="Schulz, U" sort="Schulz, U" uniqKey="Schulz U" first="U." last="Schulz">U. Schulz</name>
</author>
<author><name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-76928-6_47</idno>
<idno type="url">https://api.istex.fr/document/EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000B68</idno>
<idno type="wicri:Area/Istex/Curation">000B53</idno>
<idno type="wicri:Area/Istex/Checkpoint">000764</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Mihov S:using:automated:error</idno>
<idno type="wicri:Area/Main/Merge">000D62</idno>
<idno type="wicri:Area/Main/Curation">000D49</idno>
<idno type="wicri:Area/Main/Exploration">000D49</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens</title>
<author><name sortKey="Mihov, Stoyan" sort="Mihov, Stoyan" uniqKey="Mihov S" first="Stoyan" last="Mihov">Stoyan Mihov</name>
<affiliation><wicri:noCountry code="subField">Sciences</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Mitankin, Petar" sort="Mitankin, Petar" uniqKey="Mitankin P" first="Petar" last="Mitankin">Petar Mitankin</name>
<affiliation><wicri:noCountry code="subField">Sciences</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Gotscharek, Annette" sort="Gotscharek, Annette" uniqKey="Gotscharek A" first="Annette" last="Gotscharek">Annette Gotscharek</name>
<affiliation wicri:level="4"><country>Allemagne</country>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author><name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
<affiliation wicri:level="4"><country>Allemagne</country>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author><name sortKey="Schulz, U" sort="Schulz, U" uniqKey="Schulz U" first="U." last="Schulz">U. Schulz</name>
<affiliation wicri:level="4"><country>Allemagne</country>
<placeName><settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author><name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
<affiliation><wicri:noCountry code="subField">Alberta</wicri:noCountry>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5</idno>
<idno type="DOI">10.1007/978-3-540-76928-6_47</idno>
<idno type="ChapterID">47</idno>
<idno type="ChapterID">Chap47</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.</div>
</front>
</TEI>
<affiliations><list><country><li>Allemagne</li>
</country>
<region><li>Bavière</li>
<li>District de Haute-Bavière</li>
</region>
<settlement><li>Munich</li>
</settlement>
<orgName><li>Université Louis-et-Maximilien de Munich</li>
</orgName>
</list>
<tree><noCountry><name sortKey="Mihov, Stoyan" sort="Mihov, Stoyan" uniqKey="Mihov S" first="Stoyan" last="Mihov">Stoyan Mihov</name>
<name sortKey="Mitankin, Petar" sort="Mitankin, Petar" uniqKey="Mitankin P" first="Petar" last="Mitankin">Petar Mitankin</name>
<name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
</noCountry>
<country name="Allemagne"><region name="Bavière"><name sortKey="Gotscharek, Annette" sort="Gotscharek, Annette" uniqKey="Gotscharek A" first="Annette" last="Gotscharek">Annette Gotscharek</name>
</region>
<name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
<name sortKey="Schulz, U" sort="Schulz, U" uniqKey="Schulz U" first="U." last="Schulz">U. Schulz</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D49 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000D49 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5 |texte= Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens }}
This area was generated with Dilib version V0.6.32. |